DSC478 Project - Analysis (Mourad)

Credit Card Customers Churn

Helper Functions

Variable Type Description
Clientnum Num Client number. Unique identifier for the customer holding the account
Attrition_Flag Char Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age Num Demographic variable - Customer's Age in Years
Gender Char Demographic variable - M=Male, F=Female
Dependent_count Num Demographic variable - Number of dependents
Education_Level Char Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status Char Demographic variable - Married, Single, Unknown
Income_Category Char Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Card_Category Char Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book Num Months on book (Time of Relationship)
Total_Relationship_Count Num Total no. of products held by the customer
Months_Inactive_12_mon Num No. of months inactive in the last 12 months
Contacts_Count_12_mon Num No. of Contacts in the last 12 months
Credit_Limit Num Credit Limit on the Credit Card
Total_Revolving_Bal Num Total Revolving Balance on the Credit Card
Avg_Open_To_Buy Num Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1 Num Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt Num Total Transaction Amount (Last 12 months)
Total_Trans_Ct Num Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Num Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio Num Average Card Utilization Ratio

Read Train dataset for analysis

The train dataset will be used for analysis.

I can see imbalanced target values at 16%

This means that accuracy scores for classification will be misleading, we will need to consider F1, Recall, and Precision scores.

Since, the target variable explains attrited customers, then we can be more tolerant with Type 1 error vs Type 2, where we might mis-classify customers to be attrited and pro-actively approached them to retain them instead of ignoring them and losing them.

Accordingly, while seeking the highest F1 score to achieve a reasonable balance, we should be favoring better recalls.

Define Features Names (For Convenience)

Scale and Transform

Correlations

Let's check correlations and only showing the highly correlated pairs having greater than abs(.6)

Pairplot

Let's check the pairplot and see if we can see any linear relations, or notice unique distributions (double click the pairplot to read details)

Interesting Relations

By inspecting correlations among the features, I noticed 4 interesting relations. I'll try to investigate further.

['months_on_book','customer_age']
['avg_open_to_buy','credit_limit']
['avg_utilization_ratio','total_revolving_bal']
['total_trans_ct','total_trans_amt']

Focus on total_trans_ct and total_trans_amt

Features Analysis

Visualizing the top tree levels

We can identify that total_trans_ct, total_revolving_bal, total_trans_amt are the top influencing features.

If the tree is not visible in the notebook. Re-rerun the notebook, or check the html notebook export.

Features by Importance

We can observe that total_trans_ct, total_revolving_bal, and total_trans_amt are the top 3 influentiol features on attrited customers. So basically customers with low interactions are the most highly attributed factors for customers to leave.

Cluster Analysis

Trying KMeans

The outcome of the silhoutte is confusing vs the completeness_score and homogeneity_score measures. It seems that the clustering is capturing another aspect of the customers that is more influential than being attrited or not.

Cluster assignments representation

I can't indetify a coherent pattern in relation to attrited customers

After forcing 2 clusters as my target and getting really bad numbers, I will try to measure the best number of clusters fits

The elbow-curve seems very smooth without an obvious break.

all bad scores

Trying AgglomerativeClustering

Still, I can't indetify a coherent pattern in relation to attrited customers

Applying KernelPCA

I will try to apply different KernelPCA configutions to explore the cluster patterns of attrited customers in higher dimensions

Trying to tune the Kernel Parameters to find the best completeness and homogeneity scores

It's clear that attrited customers are not the most influential profile characteristic, seems that there are other profiling charactersics. Since my variable of interest is attrited customers, I'm concluding that clustering is not appropriate for analysing the attrited customers.

In my next steps, I will try classification modeling to predict attrited customers.

End